Áååë Ì Blockinòò Blockin

نویسندگان

Graham Cormode

S. Muthukrishnan

چکیده

Data streams often onsist of multiple signals. Consider a stream of multiple signals (i; ai;j) where i's orrespond to the domain, j's index the di erent signals and ai;j 0 to the value of the jth signal at point i. We study the problem of determining the dominan e norms over the multiple signals, in parti ular the max-dominan e norm, de ned as Pimaxjfai;jg. It is used in appli ations to estimate the \worst ase in uen e" of multiple pro esses, for example in IP traÆ analysis, ele tri al grid monitoring and nan ial domain. Besides nding many appli ations, it is a natural measure: it generalizes the notion of union of data streams and may be alternately thought of as estimating the L1 norm of the upper envelope of multiple signals. We present the rst known data stream algorithm for estimating max-dominan e of multiple signals. In parti ular, we use workspa e and time-per-item that are both sublinear (in fa t, poly-logarithmi ) in the input size. The algorithm is simple and implementable; its analysis relies on using properties of stable random distributions with small parameter , whi h may be a te hnique of independent interest. In ontrast we show that other dominan e norms | min-dominan e (Piminjfai;jg), ount-dominan e (jfijai > bigj) or relative-dominan e (Pi ai=maxf1; big), are all impossible to estimate a urately with sublinear spa e. 1 Introdu tion There is growing interest in data streams where data is generated rapidly over time in massive amounts; ea h data itemmust be pro essed qui kly as it is generated and we have only a small amount of memory (workspa e) whi h pre ludes us from storing all the data items we see. The algorithms ommunity has formalized models for designing and evaluating data stream algorithms wherein the workspa e used is poly-logarithmi in the various parameters of the stream and the time to pro ess ea h item or to evaluate desired fun tions of the input at any time is also poly-logarithmi in the stream parameters [HRR98, GM99, FKSV99, GKMS01℄. There is a growing body of data stream algorithms: see the ex ellent survey in [Mot02℄. One of the applied areas that has strong interest in these results is the networking ommunity, where traÆ logs generated from routers need to be analyzed for network management tasks. A promising approa h (perhaps the only s alable approa h) to deal with the massive data generated by the network logs is to use data stream te hniques. Another applied area is databases where data typi ally tends to be very large and one-pass algorithms are often desirable. Again, data stream te hniques t naturally into this paradigm. Therefore, algorithmi results for omputing various norms, wavelet oeÆ ients, histograms and lustering are nding appli ations in networking and databases (see [Mot02℄ for details and referen es). Data streams also arise in surprising number of other areas, su h as atmospheri monitoring [NOA℄, web a tivity \ li kstreams", and nan ial analysis. The starting point of our investigation here is the observation that data streams are often not individual signals, but they omprise multiple signals presented in interspersed manner. For example, web li k streams may be arbitrary ordering of li ks by di erent ustomers at di erent web servers of a server farm; nan ial events may be sto k a tivity from multiple ustomers on sto ks from many di erent se tors and indi es; and IP traÆ logs may be logs at management stations of the umulative traÆ at di erent time periods from multiple router links. Our fo us here from a on eptual point of view is what we an measure (and monitor, in appli ations) about the set of all distributions or signals we see in the data stream. Previous work on measuring individual data stream signals has been extensive: estimating frequen y moments [AMS96℄, wavelet oeÆ ients [GKMS01℄, histograms [GKS01, TIGK02℄, and other problems. For the ase of multiple signals, prior work has typi ally fo used on estimating pairwise distan es su h as Lp norms [FKSV99, Ind00℄, whi h an be used to luster multiple signals or do proximity sear hes if we an store spa e proportional to the number of signals. Our motivation here is the study of data stream problems where a large number of signals are present. In parti ular, we fo us on measuring and monitoring the umulative trends in presen e of multiple signals. In order to formalize our dis ussion here, we let the data stream be series of items (i; ai;j) presented in some arbitrary order; i's orrespond to the domain of the distributions (assumed to be identi al without loss of generality), j's to the di erent distributions and ai;j is the value of the distribution j at lo ation i (we will assume ai;j 0 for dis ussions here). We wish to measure and monitor the set of all the distributions, and { 2 { Figure 1: A mixture of distributions (left), and their upper envelope (right) fo us on al ulating an example of what we all dominan e norms, parti ularly the maxdominan e de ned to be Pimaxjfai;jg. Intuitively, this orresponds to omputing the L1 norm of the upper envelope of the distributions, illustrated s hemati ally in Figure 1. Computing the max-dominan e norm of multiple data distributions is interesting for many important reasons. First, appli ations abound where this measure is suitable for estimating the \worst ase in uen e" under multiple distributions. For example, in the IP network s enario, i's orrespond to sour e IP addresses and ai;j orresponds to the number of pa kets sent by IP address i in the jth transmission. Here the max-dominan e measures the maximum possible utilization of the network if the transmissions from di erent sour e IP addresses were oordinated. (Similar analysis of network apa ity using max-dominan e is relevant in ele tri al grid [Ele℄ and in other instan es with IP networks, su h as using SNMP logs [Rou02, LSC℄). The on ept of max-dominan e o urs in nan ial appli ations, where the maximum dollar index (MDI) for se urities lass a tion lings hara terize the intensity of litigation a tivity through time [Se ℄. In addition to nding spe i appli ations, max-dominan e norm has intrinsi on eptual interest. For example, if ai;j's were 0=1, then this norm redu es to al ulating the union of the multiple sets. Therefore max-dominan e norm is a generalization of the standard union operation. Furthermore, there are other dominan e norms with multiple distributions. Closely related to max-dominan e norms is min-dominan e | Piminjfjai;jjg) | or more generally Pi quantilesjfjai;jjg). Closely related to various ordering (not just quantiles) of values are relative dominan e norms. Relative ount dominan e is based on ounting1 the number of pla es where one distribution dominates another, jfijai > bigj for two given data distributions a and b, and relative sum dominan e whi h is Pif ai maxf1;bigg.2 All of these dominan e norms are very natural for ollating information from two or more data distributions. 1The relative ounting dominan e norm has previously arisen in string mat hing ontext [AF91℄. However, the fo us there was not on streaming. 2The problem of estimating the relative norm Pif ai maxf1;bigg has been proposed as an open problem in [CCFC02℄ and in a talk by Monika Henzinger at the AMS-SIAM Joint meeting in January 2002. { 3 { 1.1 Our Results Let us rst elaborate on the hallenge in omputing dominan e norms of multiple data streams by fo using on the max-dominan e norm. If we had spa e proportional to the range of values i then for ea h i, we an store maxjfjai;jjg for all the ai;j's seen thus far, and in rementally maintain Pimaxjfjai;jjg. However, in our motivating s enarios, the input arrives as a stream, and we need to work with sublinear spa e (poly-logarithmi in the range of the is, the size of the stream, n, and in M , the range of the js) and take very little time (also poly-logarithmi in these quantities) for pro essing per item as well as for estimating the dominan e norm. In this data stream model, algorithms for omputing maxdominan e norms are no longer obvious. Surprisingly no data stream algorithms are known for estimating any of the dominan e norms. Mu h of the re ent urry of results in data streams has fo used on using various Lp norms to ompare and ollate information from di erent data streams, for example, (Pi(Pj ai;j)p)1=p for 0 < p 2 [AMS96, FKSV99, Ind00, CIKM02℄ and related notions su h as Hamming norms Pi((Pj ai;j) 6= 0) [CDIM02℄. While these norms are suitable for apturing omparative trends in multiple data streams, they are not appli able for omputing the various dominan e norms (max, min, ount or relative) whi h are just as important in s enarios su h as analyzing nan ial, web li k, and network monitoring appli ations. We initiate the study of estimating dominan e norms in data streams. Our main result in Se tion 3 is an algorithm that estimates max-dominan e to 1 + approximation with probability at least 1 Æ using spa e O(1= 2(logM + 1 logn) log 1=Æ) and taking time O(1= 4 log ai;j logn log 1=Æ) per item. In other words, both spa e used and time taken per item or for estimating the max-dominan e is only poly-logarithmi in the input parameters. Our approa h is to show that random variables whose ombination is pre isely the maxdominan e that we require an be simulated using stable probability distributions for extremely small stability parameter > 0. We take inspiration from the pioneering work of [Ind00℄ that introdu ed the use of stable distributions to data stream algorithms; however, the use of stable distributions with tiny in this manner is novel. A te hni al rux of the work is to perform a detailed analysis of stable distributions with tiny whi h may be of independent interest. The remainder of our work on max-dominan e attends to important pra ti al details: how to generate these random variables in small spa e? (we use small spa e pseudo random generators as is ustomary in data stream algorithmi s, but generalize it for our random variables); and how to speed up the algorithm so we do not take time O(ai;j) per item whi h is prohibitive? (we introdu e a way to group random variables and perform desired omputations in large hunks redu ing the overall time to O(log ai;j) in the pro ess). Throughout we have made an e ort to make the algorithm simple, and hen e, implementable. No su h result was known for estimating any dominan e norms in the data stream model. In ontrast to this result, in Se tion 4 we prove that the other dominan e norms dis ussed | min-dominan e, relative ount dominan e and relative sum dominan e | need linear spa e to be approximated in the streaming model. These use redu tions from other problems known to be hard in the streaming and ommuni ation omplexity models. Finally, in Se tion 5 we { 4 { make some on luding remarks and related omments. 2 Preliminaries 2.1 Data Streams and Dominating Norms As the idea of omputations on the data stream has risen in popularity, a number of di erent formalisations have been provided by di erent authors, depending on the nature of the problem they are des ribing. We give a brief summary of the notation we will be using to handle data streams. A data stream is a sequen e of tuples of the form (i; ai;j) meaning that the value ai;j is to be ounted towards the i'th entry of an impli it ve tor a. The ve tor a is initially zero, and the tuples arrive in some order that is beyond our ontrol (we may as well assume they are adversarially ordered). In parti ular, there is no relation between the order of arrival and the index parameter i. For onvenien e of notation, we use the index j to indi ate that ai;j is the jth tuple with index i. However, j is not made expli it in the stream and is in general not available to a stream algorithm to use. Similarly, ni is the number of tuples for index i seen so far, and n =Pi ni is the total number of tuples seen in the data stream. In this paper, we will dis uss streams where ea h ai;j is an integer (usually interpreted as a non-negative integer) bounded by a quantity M . It may be onvenient to assume that M is bounded by some polynomial in n. For an algorithm that works in the data stream to be of interest, the working spa e and per item pro essing time must both be sublinear in n and M , and ideally poly-logarithmi in the these quantities. The max-dominan e of su h a stream a is de ned as dommax(a) = Pimax1 l nifai;lg Equivalently, we de ne the i'th entry of the impli it state ve tor as max1 l nifai;lg, and the dommax fun tion is the L1 norm of this ve tor. Note that it is a Norm, as dommax(0) = 0, dommax(ka) = k dommax(a) and dommax(a+ b) dommax(a) + dommax(b). We will also dis uss other quantities su h as min-dominan e, relative norms, and quantile dominan e, although these are not in general norms. 2.2 Stable Distributions Indyk pioneered the use of Stable Distributions in data streams and sin e then have reeived a great deal of attention [Ind00, CIKM02, CDIM02, GGI+02℄. They have been used to generalize the Johnson-Lindenstrauss lemma [JL84℄ (whi h approximates the Eu lidean distan e) to all Lp distan es with 0 < p 2 [Ind00℄. A stable distribution is de ned by four parameters. These are (i) the stability index, 0 < 2; (ii) the skewness parameter, 1 1; (iii) s ale parameter, > 0; and (iv) lo ation parameter, Æ. Throughout we shall deal with a anoni al representation of stable distributions, where = 1 and Æ = 0. Therefore, we onsider stable distributions S( ; ) so that, given and the distribution is uniquely de ned by these parameters. We write X S( ; ) to denote the random variable X is distributed as a stable distribution with { 5 { parameters and . When = 0, as we will often nd, then the distribution is symmetri about the mean, and is alled stri tly stable. De nition 2.1 The stri tly stable distribution S( ; 0) is de ned by the property X S( ; 0); Y S( ; 0); Z S( ; 0)) aX + bY Z; a + b = That is, if X and Y are distributed with stability parameter , then any linear ombination of them is also distributed as a stable distribution with the same stability parameter . The result is s aled by the s alar where = ja + b j1= . The de nition uniquely de nes a distribution, up to s aling and shifting. By entring the distribution on zero and xing a s ale, we an talk about the stri tly stable distribution with index . From the de nition it follows that X1 : : :Xn S( ; 0); a = (a1; : : : ; an); jjajj = (Xi jaij )1= )Xi aiXi jjajj S( ; 0) 3 Approximating the Max-Dominan e Norm of a Data Stream Re all that we wish to ompute the sum of the maximum values seen in the stream. That is, we want to nd dommax(a) =Xi maxfai;1; ai;2; : : : ai;nig If ea h index i appears at most on e, then this is trivial to ompute exa tly, sin e it is just Pi ai;1. Similarly, if ea h index appears at most times, then the quantity Pi;j ai;j is a -approximation to the max-dominan e of the stream. However, in general we have no su h guarantee nor any reason to expe t one, and so we must assume that ea h index may be represented many times over. We will show how the max-dominan e an be found approximately by using values drawn from stable distributions. This allows us to state our main theorem: Theorem 3.1 It is possible to ompute an approximation to Pimax1 j nifai;jg in the streaming model that is orre t within a fa tor of (1 + ) with probability 1 Æ using spa e O(1= 2(logM + 1 logn) log 1=Æ) and taking time O(1= 4 log ai;j logn log 1=Æ) per item. 3.1 Idealized Algorithm We rst give an outline algorithm, then go on to show how this algorithm an be applied in pra ti e on the stream with small memory and time requirements. We imagine that we have a ess to a spe ial indi ator distribution X. This has the (impossible) property that for any positive integer (that is, > 0) then E( X) = 1. From this it is possible to derive a solution problem of nding the max-dominan e of a stream of values. We maintain a s alar { 6 { z, initially zero. We reate a set of xi;k, ea h drawn from distributions Xi;k X. For every ai;j in the stream we update z as follows: z z + ai;j Xk=1 xi;k This maintains the property that the expe tation of z isPimaxjfai;jg, as required. This is a onsequen e of the \impossible" property of Xi;k that it ontributes only 1 to the expe tation of z no matter how many times it is added. For example, suppose our stream onsists of f(i = 1; a1;1 = 2); (3; 3); (3; 5)g. Then z is distributed as X1;1+X1;2+2X3;1+2X3;2+2X3;3+ X3;4 +X3;5. The expe ted value of z is then the number of di erent terms, 7, whi h is the max dominan e that we require (2+5). The required a ura y an be a hieved by in parallel keeping several di erent values of z based on independent drawings of values for xi;k. There are a number of hurdles to over ome in order to turn this idea into a pra ti al solution. 1. How to hoose the distributions Xi;k? We shall see how appropriate use of stable distributions an a hieve a good approximation to these indi ator variables. 2. How to redu e spa e requirements? The above algorithm requires repeated a ess to xi;k for many values of i and k. We need to be able to provide this a ess without expli itly storing every xi;k that is used. We also need to show that the required a ura y an be a hieved by arrying out only a small number of independent repetitions in parallel. 3. How to ompute eÆ iently? We require fast per item pro essing, that is polylogarithmi in the size of the stream and the size of the items in the stream. But the algorithm above requires adding ai;j di erent values to a ounter in ea h step: time linear in the size of the data item (that is, exponential in the size of its binary representation). We show how eÆ iently ompute the ne essary range sums while ensuring that the memory usage remains limited. For ea h hurdle, we shall how it an be leared whilst working within the streaming model. 3.2 Our Algorithm We will use stable distributions with small stability parameter in order to approximate the indi ator variable Xi;j. Stable distributions an be used to approximate the number of non-zero values in a ve tor. From the idealized algorithm above, if we think about the ve tor formed for ea h index i as a(i)k = jfjjai;j kgj then the number of non-zero entries of a(i) = maxj ai;j. We shall write a for the ve tor formed by on atenated all su h ve tors for di erent i. This is an alternate representation of the stream, a. To approximate the max-dominan e, we will maintain a sket h ve tor z(a) whi h summarizes the stream a. { 7 { De nition 3.1 The sket h ve tor z(a) has a number of entries m to be spe i ed shortly. We make use of a number of values xi;k;l, ea h of whi h is drawn from S( ; 0), for to be determined later. Initially z is set to zero in every dimension. Invariant We maintain the property for ea h l that zl =Pi;jPai;j k=1 xi;k;l Update Pro edure On re eiving ea h pair (i; ai;j) in the stream, we maintain the invariant by updating z as follows: 81 l m: zl zl + ai;j Xk=1 xi;k;l Output Our approximation of the dominating norm is ln 2(mediank jzkj) At any point, it is possible to extra t from the sket h z(a) a good approximation of the sum of the maximum values. Theorem 3.2 (1 ) dommax(a) ln 2 (mediank jz(a)kj) (1 + )2 dommax(a) Proof: From the de ning property of stable distributions (De nition 2.1), we know by onstru tion that ea h entry of z is drawn from the distribution jjajj S( ; 0). We know that we will add any xi;k;l to zk at most on e for ea h tuple in the stream, so we have an upper bound U = n on ea h entry of a. A simple observation is that for small enough and an integer valued ve tor then the norm jjajj (L norm raised to the power , whi h is just Pi a i ) approximates the number of non-zero entries in the ve tor. Formally, if we set an upper bound U so that 8i:jaij U and x 0 < = log2 U then jfijai 6= 0gj = X ai 6=0 1 X ai 6=0 jaij = jjajj X ai 6=0U exp( ln 2)jfijai 6= 0gj (1+ )jfijai 6= 0gj Using this, we hoose to be = log2 n sin e ea h value i appears at most n times within the stream, so U = n. This guarantees dommax(a) jjajj (1 + ) dommax(a) Lemma 3.1 If X S( ; ) then lim !0+median(j Xj ) = j j median(jXj ) = j j ln 2 Proof: Let E be distributed with the exponential distribution with mean one. Then lim !0+ S( ; ) = E 1 [Cre75℄. The density of E 1 is f(x) = x 2 exp( 1=x); x > 0 and the umulative density is F (x) = Z x 0 f(x)dx = exp( 1=x) so median(E 1) = F 1(1=2) = 1= ln 2 { 8 { Consequently 8k:jzkj jjajj jXj and median j jjajj Xj = jjajj = ln 2. It is noteworthy that analysis of the s aling onstant ln 2 is the sharpest yet known for the behavior of stable distribution pro ess with small . Previously, several results presented estimates for the ne essary s aling onstant with only experimental eviden e. We next make use of a standard sampling result: Lemma 3.2 Let X be a distribution with umulative density fun tion F (x). If derivative of the inverse of F (X) is bounded by a onstant around the median then the median of O(1= 2 log 1=Æ) samples from X is within a fa tor of 1 of median(X) with probability 1 Æ. The derivative of the inverse density is indeed bounded at the median, sin e F 1(r) = 1= ln r, and (F 1)0(1 2) < 5. Hen e for a large enough onstant , by taking a ve tor z with m = = 2 log 1=Æ entries, ea h based on an independent repetition of the above pro edure, then we an approximate the desired quantity and so (1 )jjajj ln 2mediank jzkj (1 + )jjajj with probability 1 Æ by this Lemma. Thus to nd our approximation of the sum of the maximum values, we maintain the ve tor z as the dot produ t of the underlying ve tor a with the values drawn from stable distributions, xi;k;l. When we take the absolute value of ea h entry of z and nd their median, the result raised to the power and s aled by the fa tor of ln 2 is the approximation of dommin(a). 3.3 Spa e Requirement For the algorithm to be appli able in the streaming model, we need to ensure that the spa e requirements are minimal, and ertainly sublinear in the size of the stream. Therefore, we annot expli itly keep all the values we draw from stable distributions, yet we require the values to be the same ea h time the same entry is requested at di erent points in the algorithm. This problem an be solved by using pseudo-random generators: we do not store any xi;k;l expli itly, instead we reate it as a pseudo-random fun tion of k, i, l and a small number of stored random bits whenever it is needed. We need a di erent set of random bits in order to generate ea h of the m instantiations of the pro edure. We therefore need only onsider the spa e required to store the random bits, and to hold the ve tor z. It is known that although there is no losed form for stable distributions for general , it is possible to draw values from su h distributions for arbitrary by using a transform from two independent uniform random variables. Lemma 3.3 (Due to Chambers, Mallows and Stu k [CMS76℄) Let U be a uniform random variable on [0; 1℄ and uniform on [ 2 ; 2 ℄. Then S( ; 0) sin ( os )1= os(1 ) lnU 1 { 9 { We also make use of two other results on random variables from the literature (see for example [UZ99℄), with whi h we will prove the spa e requirements for the algorithm. Lemma 3.4 (i) X S( ; 0); Y S( ; 1); Z S( ; 1)) X 2 1= (Y Z) (ii) X S( ; 1); ! 0+ ) f(x) = O( exp( x )x 1); x > 0 Lemma 3.5 The spa e requirement of this algorithm is O(1= 2(logM + 1 logn) log(1=Æ)) bits. Proof: For ea h repetition of the pro edure, we require O(logn) random bits to instantiate the pseudo-random generators, as per [Nis92, Ind00℄. We also need to onsider the spa e used to represent ea h entry of z. We analyze the pro ess at ea h step of the algorithm: a value x is drawn (pseudo-randomly) from S( ; 0), and added to an entry in z. The number of bits needed to represent this quantity is log2 jxj. The umulative distribution of the limit from Lemma 3.4 (ii) is F =1(x) = Z x 0 exp( x )x 1dx = exp( x ) and so F 1 =1(r) = (ln r 1) 1= 0 r 1 So jxj = O(F 1 =0(r)) = O(2 1= (ln r 1) 1= ) 0 r 1 by Lemma 3.4 (i) Therefore log2 jxj = O(1= log ln r). The dependen e on is O( 1), whi h was set in Theorem 3.2 as = = logn. So representing x requires O(logn= ) bits. Ea h entry of z is formed by summing many su h variables. The total number of summations is bounded by Mn. So the total spa e to represent ea h entry of z is logzk = O(logMnx) = O(logM + logn + logn = ) The total spa e required for all O(1= 2 log 1=Æ) entries of z is O(1= 3 log 1=Æ logn) if we assume M is polynomially bounded by n. 3.4 Per Item Pro essing Time For ea h item, we must ompute several sums of variables drawn from stable distributions. Dire tly doing this will take time proportional to ai;j. We ould pre ompute sums of the ne essary variables, but we wish to avoid expli itly storing any values of variables to ensure that the spa e requirement remains sublinear. However, the de ning property of stable distributions is that the sum of any number of variables is distributed as a stable distribution. Lemma 3.6 The sum Pai;j k=1 xi;k an be approximated up to a fa tor of 2 in log ai;j steps. { 10 { Proof: Suppose we repla ed ea h ai;j with the smallest power of 2 that is at least as big, that is, with 2dlog2 ai;je. If we nd dommax of this modi ed stream, then this will be a 2approximation of the true answer. Applying this rounding pro edure, we an nd the sum of variables drawn from stable distributions by the identity: 2s Xk=1 xi;k = xi;1 + xi;2 + 4 Xk=3 xi;k + 8 Xk=5 xi;k + : : : 2s X k=2s+1 xi;k = xi;1 + s Xa=0 2a+1 X k=2a+1xi;k Ea h term in this sum is a sum of 2a variables with stable distribution. From De nition 2.1, we know that P4k=3 xi;k S( ; 0) + S( ; 0) 21= S( ; 0). Generalising, 2a+1 X k=2a+1xi;k 2a= S( ; 0) Hen e to nd the desired sum, we only need to nd log2 n values drawn from S( ; 0), and ompute their s aled sum. By using the pseudo-random method in Lemma 3.5, we guarantee that the values found will be onsistent with other drawings from this distribution. Lemma 3.7 The sum Pai;j k=1 xi;k an be approximated up to a fa tor of 1 + in O(1 log ai;j) steps. Proof: To give a 1 + approximation instead of the above 2-approximation, we imagine rounding ea h ai;j to the losest value of b(1 + )s , guaranteeing an answer that is no more than (1 + ) the true value. So we ompute sums of the form 0 b(1+ )s+1 X k=b(1+ )s +1 xi;k1A (b(1 + )a+1 b(1 + )a )1= S( ; 0) The sum an be omputed in log1+ ai;j = log ai;j log 1+ = O(1 log ai;j) steps. By more advan ed methods, it is possible to improve the number of steps required to ompute the range sums of variables drawn from stable distributions. Corollary 1 The sumPai;j k=1 xi;k an be omputed in O(log ai;j) steps using onditional stable distributions. In outline, the pro edure does a binary sear h to rea h ai;j, hen e takes log ai;j steps. However, rather than drawing from stable distributions dire tly, when halving an interval we need to draw from a stable distribution onditioned on the fa t that we know the sum of that interval. This requires an analogue of Lemma 1 of [GGI+02℄ and we do not dis uss this approa h further here sin e it is not ne essary for the result as stated. The main Theorem 3.1 follows as a onsequen e of ombining Theorem 3.2 with Lemmas 3.5 and 3.7 with appropriate res aling of . { 11 { 4 Hardness of other Dominan e Norms 4.1 Min-Dominan e We re all the de nition of the min-dominan e,Piminjfai;jg. We show that, unlike the maxdominan e norm, it is not possible to ompute a useful approximation to this quantity in the data stream model. This is shown by using a redu tion from the size of the interse tion of two sets, a problem that is known to be hard to approximate in the ommuni ation omplexity model. Theorem 4.1 Any algorithm to ompute any onstant fa tor approximation to dommin(a) requires (n) bits of storage Proof: We show a simple transformation to a problem known to be hard in the ommuniation omplexity model. Consider two people, A and B, who ea h hold a subset (X and Y respe tively) of the integers 1 to n. Suppose that there were some algorithm whi h allowed the approximation of the dommin fun tion, whi h is known to both A and B. Then the following intera tion takes pla e: A omputes the following data stream Str(X) from the set X: Str(X)i = (i; 1) () i 2 X Str(X)i = (i; 0) () i 62 X and passes this stream to the algorithm to pro ess. Following this, A takes the omplete memory state of the algorithm and ommuni ates it to B. B then pro eeds to run the same algorithm based on this memory state applied to the same transformation of Y , that is, Str(Y ). The result is an approximation of dommin for the on atenation of the streams Str(X) and Str(Y ). But dommin(Str(X)jjStr(Y )) = jX \ Y j | in other words, the sum of the minimum values is exa tly the interse tion size of the two sets. In the ommuni ation omplexity model, any algorithm that an approximate the size of interse tion of two sets to any onstant fa tor with onstant probability must ommuni ate (n) bits of information. This is a onsequen e of the hardness of the disjointness problem [Raz92, KN97℄. Therefore, the size of the memory state of the proposed streaming algorithm must be (n) bits. Note that this is as strong a hardness result as we an hope for. The fun tion dommin an be omputed exa tly using O(n) spa e by storing the stream expli itly. The stream provided has as many onvenient properties as we an arrange without making the problem trivial: zero values are represented expli itly, ea h index o urs exa tly twi e3, and all the values seen are either zero or one. 3Note that if ea h index o urs at most on e then trivially dommin(a) = dommax(a) =Pi ai;1 { 12 { 4.2 Other Dominan e and Relative Norms Finding the a umulation of any averaging fun tion (Mean, Median or Mode) of a mixture of signals requires as mu h storage as there are di erent signals. Theorem 4.2 Computing Pi(Pj ai;j=ni) on the stream to any onstant fa tor requires (n= ) bits of storage. Proof: We pi k X f1 : : :mg su h that jXj = m=2. We feed Str(X) into the algorithm, and take a opy of the memory ontents. We an then pi k an arbitrary index k and send a stream of values 8i 6= k; 1 j < m:(i; 0j) to the algorithm. Then Xi ai=ni = ak + (m=2 ak)= m = 1=(2 ) () ak = 0 = 1 + 1= (1=2 1=m) () ak = 1 The ratio of these two quantities is 1 2 : 1 + 1 1 2 1 m and if m 4 then this is at least 1 2 : 4 + 1 4 = 1 : 4 + 1 2 If we ould approximate to better than this ratio, then we ould distinguish the two ases, then we ould retrieve whether any k 2 X with onstant probability. Therefore, the memory requirement is (m) = (n= ). We are free to set as large as we like, showing in parti ular that any onstant fa tor approximation requires (n) memory use. We summarize the hardness results for other dominan e norms, whi h are mostly similar to those presented above. Approximating the median or mode dominan e (Pimedianjfai;jg andPimodejfai;jg respe tively) requires (n) spa e, by a redu tion from interse tion size. Relative norms, whi h onsist of two distin t stream a and b also prove to be diÆ ult. Approximating the relative sum norm P ai=maxf1; big to any onstant with onstant probability requires (n= ) bits of storage. We omit the proof for brevity: it essentially follows the same pattern as Theorem 4.2. Likewise, approximating the relative ount jfijai > bigj is as hard as approximating set di eren e, jXnY j, whi h also requires (n) bits. 5 Con lusion Data streams often onsist of multiple signals. We initiated the study of estimating dominan e norms over multiple signals. We presented an algorithm for estimating the maxdominan e of the multiple signals that uses small (poly-logarithmi ) spa e and takes small time per operation. This is the rst known algorithm for any dominan e norm in the data stream model. In ontrast, we showed that related quantities su h as the min-dominan e annot be so approximated. We have already dis ussed some of the appli ations of max-dominan e, and we expe t it to nd many other uses, as su h, and variations thereof. The analysis that we give to { 13 { demonstrate the behavior of stable distributions with small index parameter , and our pro edure for summing large ranges of su h variables very qui kly may spur the dis overy of further appli ations of these remarkable distributions. Referen es [AF91℄ A. Amir and M. Fara h. EÆ ient 2-dimensional approximate mat hing of nonre tangular gures. In Pro eedings of the Se ond Annual ACM-SIAM Symposium on Dis rete Algorithms, pages 212{223, 1991. [AMS96℄ N. Alon, Y. Matias, and M. Szegedy. The spa e omplexity of approximating the frequen y moments. In Pro eedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, pages 20{29, 1996. [CCFC02℄ M. Charikar, K. Chen, and M. Fara h-Colton. Finding frequent items in data streams. In Pro eedings of 29th International Colloquium on Automata, Languages and Programming, 2002. [CDIM02℄ G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using Hamming norms. In Pro eedings of 28th International Conferen e on Very Large Data Bases, 2002. [CIKM02℄ G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishnan. Fast mining of tabular data via approximate distan e omputations. In Pro eedings of the International Conferen e on Data Engineering, 2002. [CMS76℄ J.M. Chambers, C.L. Mallows, and B.W. Stu k. A method for simulating stable random variables. Journal of the Ameri an Statisti al Asso iation, 71(354):340{ 344, 1976. [Cre75℄ N. Cressie. A note on the behaviour of the stable distributions for small index . Zeits hrift fur Wahrs heinli hkeitstheorie und verwandte Gebiete, 33:61{64, 1975. [Ele℄ http://energy risis.lbl.gov/. [FKSV99℄ J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate L1-di eren e algorithm for massive data streams. In Pro eedings of the 40th Annual Symposium on Foundations of Computer S ien e, pages 501{511, 1999. [GGI+02℄ A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, small-spa e algorithms for approximate histogram maintenan e. In Proeedings of the 34th ACM Symposium on Theory of Computing, 2002. { 14 { [GKMS01℄ A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Sur ng wavelets on streams: One-pass summaries for approximate aggregate queries. In Pro eedings of 27th International Conferen e on Very Large Data Bases, 2001. [GKS01℄ S. Guha, N. Koudas, and K. Shim. Data streams and histograms. In Pro eedings of Symposium on Theory of Computing, pages 471{475, 2001. [GM99℄ P. Gibbons and Y. Matias. Synopsis stru tures for massive data sets. DIMACS Series in Dis rete Mathemati s and Theoreti al Computer S ien e, A, 1999. [HRR98℄ M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Te hni al Report SRC 1998-011, DEC Systems Resear h Centre, 1998. [Ind00℄ P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream omputation. In Pro eedings of the 40th Symposium on Foundations of Computer S ien e, 2000. [JL84℄ W.B. Johnson and J. Lindenstrauss. Extensions of Lipshitz mapping into Hilbert spa e. Contemporary Mathemati s, 26:189{206, 1984. [KN97℄ E. Kushilevitz and N. Nisan. Communi ation Complexity. Cambridge University Press, 1997. [LSC℄ Large-s ale ommuni ation networks: Topology, routing, traÆ , and ontrol. http://www.ipam.u la.edu/programs/ ntop/ ntop s hedule.html. [Mot02℄ R. Motwani. Models and issues in data stream systems. In PODS Plenary Talk at the ACM SIGMOD/PODS 2002 Conferen e, 2002. [Nis92℄ N. Nisan. Pseudorandom generators for spa e-bounded omputation. Combinatori a, 12, 1992. [NOA℄ NOAA. National O eani and Atmospheri Administration, U.S. National Weather Servi e. http://www.nws.noaa.gov/. [Raz92℄ A. A. Razborov. On the distributional omplexity of disjointness. Theoreti al Computer S ien e, 106(2):385{390, 1992. [Rou02℄ M. Roughan. http://swallow.resear h.att. om/roughan/snap.html, 2002. [Se ℄ http://se urities.stanford.edu/litigation a tivity.html. [TIGK02℄ N. Thaper, P. Indyk, S. Guha, and N. Koudas. Dynami multidimensional histograms. In Pro eedings of the ACM SIGMOD International Conferen e on Management of Data, 2002. [UZ99℄ V. V. U haikin and V. M. Zolotarev. Chan e and Stability: Stable Distributions and their appli ations. VSP, 1999.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

Êóóù×ø Ý× Èóóòø Åå Blockinòò× Êêðð Ààöööö Blockin¸ Ììóöö Öööôð¸óððò Ñôôðð £ Óñôùøøö Ë Blockin Blockin Blockinò Ôôöøññòø Ì Blockinòò Blockin Blockinð Íòòúö××øý Óó Öððò ½¼¼¼¼ Öððò¸öññòý

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

Áååë Ì Blockinòò Blockin

نویسندگان

چکیده

منابع مشابه